NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

Mai, Zheda; Zhang, Ping; Tu, Cheng-Hao; Chen, Hong-You; Zhang, Li; Chao, Wei-Lun (June 2025, IEEE)

Full Text Available
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference

Wan, Zhongwei; Shen, Hui; Wang, Xin; Liu, Che; Mai, Zheda; Zhang, Mi (April 2025, NAACL)

Full Text Available
Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

Chowdhury, Arpita; Paul, Dipanjyoti; Mai, Zheda; Gu, Jianyang; Zhang, Ziheng; Mehrab, Kazi Sajeed; Campolongo, Elizabeth G; Rubenstein, Daniel; Stewart, Charles V; Karpatne, Anuj; et al (June 2025, Proceedings of the Computer Vision and Pattern Recognition Conference)

We present a simple approach to make pre-trained Vision Transformers (ViTs) interpretable for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as bird species. Pre-trained ViTs, such as DINO, have demonstrated remarkable capabilities in extracting localized, discriminative features. However, saliency maps like Grad-CAM often fail to identify these traits, producing blurred, coarse heatmaps that highlight entire objects instead. We propose a novel approach, Prompt Class Attention Map (Prompt-CAM), to address this limitation. Prompt-CAM learns class-specific prompts for a pre-trained ViT and uses the corresponding outputs for classification. To correctly classify an image, the true-class prompt must attend to unique image patches not present in other classes' images (i.e., traits). As a result, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a "free lunch," requiring only a modification to the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM easy to train and apply, in stark contrast to other interpretable methods that require designing specific models and training processes. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate the superior interpretation capability of Prompt-CAM. The source code and demo are available at https://github.com/Imageomics/Prompt_CAM.
more » « less
Full Text Available
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Kil, Jihyung; Mai, Zheda; Lee, Justin; Chowdhury, Arpita; Wang, Zihe; Cheng, Kerrie; Wang, Lemeng; Liu, Ye; Chao, Wei-Lun (December 2024, Advances in Neural Information Processing Systems 37 (NeurIPS 2024))

The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping, while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI). In this paper, we introduce MLLM-COMPBENCH, a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). MLLM-COMPBENCH mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. We curate a collection of around 40K image pairs using metadata from diverse vision datasets and CLIP similarity scores. These image pairs span a broad array of visual domains, including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance. We use MLLM-COMPBENCH to evaluate recent MLLMs, including GPT-4V(ision), Gemini-Pro, and LLaVA-1.6. Our results reveal notable shortcomings in their comparative abilities. We believe MLLM-COMPBENCH not only sheds light on these limitations but also establishes a solid foundation for future enhancements in the comparative capability of MLLMs.
more » « less
Full Text Available
COMPBENCH: A Comparative Reasoning Benchmark for Multimodal LLMs

Kil, Jihyung; Mai, Zheda; Lee, Justin; Wang, Zihe; Cheng, Kerrie; Wang, Lemeng; Liu, Ye; Chowdhury, Arpita; Chao, Wei-Lun (December 2024, NeurIPS)

Full Text Available
Fine-Tuning is Fine, if Calibrated

Mai, Zheda; Chowdhury, Arpita; Zhang, Ping; Tu, Cheng-Hao; Chen, Hong-You; Pahuja, Vardaan; Berger-Wolf, Tanya; Gao, Song; Stewart, Charles; Su, Yu; et al (December 2024, NeurIPS)

Full Text Available
Segment Anything Model (SAM) Enhances Pseudo-Labels for Weakly Supervised Semantic Segmentation

Chen, Tianle; Mai, Zheda; Li, Ruiwen; Chao, Wei-Lun (December 2023, Conference on Neural Information Processing Systems (Workshop))

Full Text Available
Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning

Tu, Cheng Hao; Mai, Zheda; Chao, Wei-Lun (July 2023, IEEE/CVF Conference on Computer Vision and Pattern Recognition)

Intermediate features of a pre-trained model have been shown informative for making accurate predictions on downstream tasks, even if the model backbone is kept frozen. The key challenge is how to utilize these intermediate fea- tures given their gigantic amount. We propose visual query tuning (VQT), a simple yet effective approach to aggregate intermediate features of Vision Transformers. Through in- troducing a handful of learnable “query” tokens to each layer, VQT leverages the inner workings of Transformers to “summarize” rich intermediate features of each layer, which can then be used to train the prediction heads of downstream tasks. As VQT keeps the intermediate features intact and only learns to combine them, it enjoys memory efficiency in training, compared to many other parameter- efficient fine-tuning approaches that learn to adapt features and need back-propagation through the entire backbone. This also suggests the complementary role between VQT and those approaches in transfer learning. Empirically, VQT consistently surpasses the state-of-the-art approach that utilizes intermediate features for transfer learning and outperforms full fine-tuning in many cases. Compared to parameter-efficient approaches that adapt features, VQT achieves much higher accuracy under memory constraints. Most importantly, VQT is compatible with these approaches to attain even higher accuracy, making it a simple add- on to further boost transfer learning. Code is available at https://github.com/andytu28/VQT .
more » « less
Full Text Available
Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning

https://doi.org/10.1109/cvpr52729.2023.00746

Tu, Cheng-Hao; Mai, Zheda; Chao, Wei-Lun (June 2023, IEEE/CVF Conference on Computer Vision and Pattern Recognition)

Full Text Available
Holistic Transfer: Towards Non-Disruptive Fine-Tuning with Partial Target Data

Tu, Cheng-Hao; Chen, Hong-You; Mai, Zheda; Zhong, Jike; Pahuja, Vardaan; Berger-Wolf, Tanya; Gao, Song; Stewart, Charles; Su, Yu; Chao, Wei-Lun (February 2024, NeurIPS)

Full Text Available

Search for: All records